Goto

Collaborating Authors

 max 2


Results

Neural Information Processing Systems

For any > 0, the -covering number of the Euclidean ball Bd(R):= {x 2Rd: kxk2 R} with radius R> 0 in the Euclidean metric is upper bounded by (1+2R/)d. Let F0 F 1 ... FT be a filtration and let X1,X2,...,XT be real random variables such that Xt is Ft-measurable, E[Xt|Ft 1]=0, |Xt| balmost surely, and PT t=1 E[X2t |Ft 1] V for some fixed V> 0and b> 0. Then for any 2(0,1), we have with probability at least 1, For any linear MDP satisfying Definition 3.1, we must have that k (s,a)k2 1/ p d for all s and a, and k,hk2 1/ p d for all and h. By Definition 3.1, we know that Ph( |s,a)= h (s,a),ยตh()i forms a valid probability distribution, and that k R S |dยตh(s)|k2 p d. This yields the first equality. Repeating this calculation h 1more times yields the final equality. Lemma A.8. Fix some h and i (s,a)| 1, and kvk2 p d. Proof. By the linear MDP structure (see Proposition 2.3 of Jin et al. (2020)), for any j, Q j (s,a)= h (s,a),w j i = h (s,a), ji+ Z We first consider the case where u = h for some h which is a valid reward satisfying Definition 3.1. Assume that the reward in our MDP is set such that for h0 6= h, h0 =0 .


Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

Neural Information Processing Systems

While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL)--the complexity of learning on the "worstcase" instance--such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PACRL) in the setting of RL with linear function approximation. We propose an algorithm, PEDEL, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that PEDEL yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.



161882dd2d19c716819081aee2c08b98-Supplemental.pdf

Neural Information Processing Systems

We restate the theoretical statements and the algorithms here for completeness and convenience. Given a normalized monotone submodular function f:2 V! The objective aims to find the partition such that the minimum-valued block in the partition is maximized. For a ground set V and its elements (v1,v2,...,v n) coming in an arbitrary streaming order, the output solution of Alg. 1 has The output solution of Alg. 2 has We construct a set cover function as the tight example for Corollary 1. We illustrate the set cover function graphically in Figure 1. The circles are the areas to cover for the set cover function and the green inner circles and the red triangles are elements in the ground set (the outer yellow circles are not elements). The inner circles (green) largely overlap with the outer circles (yellow).